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Abstract: 

Object  recognition  is  a  practical  problem  with  a  wide  variety  of  potential  applications.  Recognition 
becomes  substantially  more  difficult  when  objects  have  not  been  presented  in  some  logical,  “posed” 
manner  selected  by  a  human  observer.  We  propose  to  solve  this  problem  using  active  object 
recognition,  where  the  same  object  is  viewed  from  multiple  viewpoints  when  it  is  necessary  to 
gain  confidence  in  the  classification  decision.  We  demonstrate  the  effect  of  unposed  objects  on  a 
state-of-the-art  approach  to  object  recognition,  then  show  how  an  active  approach  can  increase 
accuracy.  The  active  approach  works  by  attaching  confidence  to  recognition,  prompting  further 
inspection  when  confidence  is  low.  We  demonstrate  a  performance  increase  on  a  wide  variety  of 
objects  from  the  RGB-D  database,  showing  a  significant  increase  in  recognition  accuracy. 


1  Introduction 

State-of-the-art-approaches  to  visual  recogni¬ 
tion  have  focused  mostly  on  situations  when  ob¬ 
jects  are  “posed”  (i.e.,  the  camera  angle,  light¬ 
ing,  and  position  has  been  chosen  by  an  observer). 
When  conditions  become  more  variable,  the  abil¬ 
ity  to  visually  recognize  objects  quickly  decreases. 
In  one  prominent  example  demonstrating  this  af¬ 
fect,  [Pinto  et  ah,  2008]  produced  very  good  ac¬ 
curacy  classifying  objects  from  the  Caltech- 101 
dataset  [Fei-Fei  et  ah,  2004],  but  their  state-of- 
the-art  approach  was  reduced  to  performing  at 
chance  when  variation  was  introduced.  Specifi¬ 
cally,  this  meant  viewing  objects  at  any  arbitrary 
pan,  tilt,  scale,  and  rotation  (both  in  plane  and 
depth). 

Unfortunately,  such  variability  is  common  in 
the  objects  that  we  see  scattered  throughout  our 
environment.  In  some  cases  (see  figure  1(a))  it 
may  be  difficult  for  even  the  most  robust  visual 
object  recognition  approach  to  recognize  an  ob¬ 
ject.  What  results  is  a  degraded  performance 
from  the  object  recognition  system.  Figure  1(a) 
shows  two  objects  from  the  RGB-D  dataset. 
From  left  to  right  the  objects  are  a  dry  battery, 
and  a  hand  towel.  However,  in  both  cases,  the  ob¬ 
ject  classes  could  be  mistaken  with  similar  classes. 
For  example,  the  dry  battery  could  easily  be  mis¬ 


taken  for  a  flashlight  or  a  pack  of  chewing  gum. 
The  hand  towel  could  easily  be  confused  for  a  3 
ring  binder. 

Figure  1(b)  shows  the  accuracy  of  Leabra  (de¬ 
scribed  further  Section  3.1)  recognizing  a  dry  bat¬ 
tery  over  a  range  of  different  pan  angles,  with 
a  slightly  different  camera  tilt.  While  perfor¬ 
mance  is  generally  good,  there  is  a  point  at  which 
performance  drops  significantly.  A  system  that 
had  been  recognizing  objects  with  an  accuracy  of 
about  90%  suddenly  decreases  to  an  accuracy  of 
30%  when  the  pan  and  tilt  of  the  object  modified. 
An  image  from  this  region  is  shown  in  figure  1(a). 

The  strategy  of  improving  object  recognition 
through  multiple  viewpoints  is  referred  to  as  ac¬ 
tive  object  recognition  [D.  Wilkes,  1992].  Several 
([Denzler  and  Brown,  2002,  Farshidi  et  ah,  2009, 
LaPorte  and  Arbel,  2006])  have  proposed  proba¬ 
bilistic  frameworks  for  active  object  recognition. 
These  frameworks  serve  to  both  incorporate  mul¬ 
tiple  viewpoints  as  well  as  incorporating  prior 
probability.  However,  most  have  been  evaluated 
on  only  a  small  number  of  objects,  using  simple 
recognition  schemes  chosen  specifically  to  high¬ 
light  the  benefits  of  active  recognition. 

We  demonstrate  the  benefit  of  active  object 
recognition  to  improve  the  results  of  a  state-of- 
the-art  approach,  specifically,  to  improve  in  ar- 
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Figure  1:  Images  from  the  RGB-D  dataset. 

eas  where  performance  is  affected  by  the  pose  of 
an  object.  We  recognize  objects  using  Leabra  1, 
which  is  a  cognitive  computational  neural  net¬ 
work  simulation  of  the  visual  cortex.  The  neural 
networks  have  hidden  layers  designed  to  mimic 
the  functionality  of  the  primary  visual  cortex 
(VI),  the  visual  area  (V4),  and  the  inferior  tem¬ 
poral  cortex  (IT)  .  We  extend  Leabra  by  adding 
a  confidence  measure  to  resulting  classification, 
then  use  active  investigation  when  necessary  to 
improve  recognition  results. 

We  demonstrate  the  performance  on  our  sys¬ 
tem  using  the  RGB-D  [Lai  et  ah,  2011]  database. 
The  RGB-D  contains  a  full  360°  range  of  yaw,  and 
three  levels  of  pitch.  We  perform  active  object 
recognition  on  115  instances  of  28  object  classes 
from  the  RGB-D  dataset. 

The  remainder  of  the  paper  is  organized  as 
follows.  We  present  related  work  in  the  field  of 
active  object  recognition  in  Section  2.  We  dis¬ 
cuss  our  approach  in  Section  3,  then  present  ex¬ 
perimental  results  in  Section  4  with  concluding 
remarks  in  Section  5. 


2  Related  Work 

Wilkes  and  Tsotsos’  [D.  Wilkes,  1992]  sem¬ 
inal  work  on  active  object  recognition  exam¬ 
ined  8  origami  objects  using  a  robotic  arm. 

1  http:  /  /  grey.colorado.edu/emergent  / 


The  next  best  viewpoint  was  selected  using 
a  tree-based  matching  scheme.  This  sim¬ 
ple  heuristic  was  formalized  by  Denzler  and 
Brown  [Denzler  and  Brown,  2002]  who  proposed 
an  information  theoretic  measure  to  select  the 
next  best  viewpoint.  They  use  average  gray  level 
value  to  recognize  objects,  selecting  the  next  pose 
in  an  optimal  manner  to  provide  the  most  infor¬ 
mation  to  the  current  set  of  probabilities  for  each 
object.  They  fused  results  using  the  product  of 
the  probabilities,  demonstrating  their  approach 
on  8  objects. 

Jia  et  al  [Jia  et  ah,  2010]  demonstrated  a 
slightly  different  approach  to  information  fusion, 
using  a  boosting  classifier  to  weight  each  view¬ 
point  according  to  the  importance  for  recognition. 
They  used  a  shape  model  to  recognize  objects, 
using  a  boosted  classifier  to  select  the  next  best 
viewpoint.  They  recognized  9  objects  in  multiple 
viewpoints  with  arbitrary  backgrounds. 

Browatzki  et  al.  [Browatzki  et  al.,  2012]  used 
an  active  approach  to  recognize  objects  on  an 
iCub  humanoid  robot.  Recognition  in  this  case 
was  performed  by  segmenting  the  object  from 
the  background,  then  recognizing  the  object  over 
time  using  a  particle  filter.  The  authors  demon¬ 
strated  this  approach  to  recognize  6  different  cups 
with  different  colored  bottoms. 


3  Methodology 

We  use  Leabra  to  recognize  objects  (sec¬ 
tion  3.1).  Once  an  object  has  been  evaluated 
by  Leabra,  we  find  both  the  object  pose  (section 
3.2),  and  attach  confidence  to  the  resulting  classi¬ 
fication  (section  3.5).  Finally,  when  the  resulting 
classification  has  low  confidence,  we  actively  in¬ 
vestigate  (section  3.6). 

3.1  Leabra 

The  architecture  of  a  Leabra  neural  net¬ 
work  is  broken  into  three  different  layers, 
each  with  a  unique  function.  The  VI  layer 
takes  the  original  image  as  input,  then  uses 
wavelets  [Gonzalez  and  Woods,  2007]  at  multiple 
scales  to  extract  edges.  The  V4  layer  uses  these 
detected  edges  to  learn  a  higher  level  represen¬ 
tation  of  salient  features  (e.g.,  corners,  curves) 
and  their  spatial  arrangement.  The  features  ex¬ 
tracted  at  the  VI  layer  includes  multiple  scales, 
therefore  features  extracted  in  the  V4  layer  have 
a  sense  of  the  large  and  small  features  that  are 


Figure  2:  An  example  of  visual  aspects  from  one  level 
of  pitch.  The  images  show  different  visual  aspects, 
and  the  arrows  show  how  each  of  these  visual  aspects 
are  connected. 

present  in  the  object.  The  V4  layer  also  collapses 
on  location  information,  providing  invariance  to 
the  location  of  the  object  in  the  original  input 
image.  The  V4  layer  feeds  directly  into  the  IT 
activation  layer,  which  has  neurons  tuned  to  spe¬ 
cific  viewpoints  (or  visual  aspects)  of  the  object. 

3.2  Visual  Aspects 

Object  pose  plays  an  important  role  in 
recognition.  We  consider  pose  in  terms 
of  visual  aspects  [Cyr  and  Kimia,  2004, 
Sebastian  et  ah,  2004]  (see  figure  2).  When 
an  object  under  examination  is  viewed  from  a 
slightly  different  angle,  the  appearance  generally 
should  not  change.  When  it  does  not,  we  refer 
to  this  as  a  “stable  viewpoint” ,  both  the  original 
and  the  modified  viewpoint  belongs  to  the  same 
visual  aspect  V\.  However,  if  this  small  change 
in  viewing  angle  affected  the  appearance  of 
the  object,  we  would  call  this  an  “unstable 
viewpoint”  representing  a  transition  between  two 
different  visual  aspects  Vi,  and  W 

The  human  brain  stores  pose  in  a  simi¬ 
lar  manner.  Neurophysiological  evidence  sug¬ 
gests  that  the  brain  has  view-specific  encod¬ 
ing  [Kietzmann  et  ah,  2009,  Frank  et  ah,  2012]. 
In  this  encoding  scheme,  neurons  in  the  IT  cortex 
activate  differently  depending  on  how  an  object 
appears.  Referring  to  Figure  2,  when  we  look  at 
the  football  in  the  first  visual  aspect,  a  certain 
set  of  neurons  in  the  IT  layer  activate.  When  we 
look  at  the  football  in  the  second  visual  aspect,  a 
different  set  of  neurons  activate. 

To  find  visual  aspects,  we  use  the  IT  layer 
in  the  Leabra  neural  network.  We  find  visual 
aspects  using  unsupervised  learning,  clustering 
IT  activations.  We  describe  this  process  in  sec¬ 
tion  3.3. 


3.3  Finding  Aspects 

Classifying  an  object  using  a  Leabra  network  pro¬ 
duces  a  set  of  neurons  that  have  been  activated 
in  the  IT  layer.  Leabra  contains  a  total  of  210 
neurons  in  this  layer,  with  similar  activation  pat¬ 
terns  occurring  when  an  object  is  viewed  in  a 
similar  pose.  We  group  activation  patterns  using 
unsupervised  learning  through  k- means  cluster¬ 
ing  [Duda  et  ah,  2000]. 

Some  care  must  be  taken  to  establish  the  num¬ 
ber  of  clusters,  fc,  since  this  is  synonymous  with 
the  number  of  visual  aspects  of  an  objects.  This 
number  is  variable  depending  on  the  complexity 
of  the  object.  For  example  a  simple,  uniformly 
colored,  perfectly  symmetric  object  such  as  a  ball 
would  only  have  one  aspect.  That  is,  a  change 
in  viewing  angle  will  never  affect  the  appearance 
of  the  object.  Contrast  this  with  a  more  compli¬ 
cated  object,  such  as  an  automobile.  An  automo¬ 
bile  would  likely  have  a  great  number  of  visual 
aspects  because  of  its  complex  structure. 
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Figure  3:  Four  different  visual  aspects  found  using 
clustering. 

The  value  of  k  cannot  be  estimated  a  priori, 
so  we  set  this  value  using  a  heuristic  based  on 
viewpoint  stability.  A  small  change  in  viewpoint 
(6)  should  generally  not  result  in  a  new  visual 
aspect.  Therefore,  when  the  correct  value  of  k  has 
been  found,  all  of  the  elements  resulting  clusters 
(cl)  will  mostly  all  belong  to  stable  viewpoints. 
We  determine  the  quality  of  the  clustering  using 
the  hueuristic  shown  in  Eq.  1,  where  c  represents 
a  cluster. 

_  viec| d(pose(i))  =  d(pose(i)  +  £)| 
myc)  —  |^|  [l) 

To  determine  the  correct  number  of  visual  as¬ 
pects,  we  set  k  to  a  large  number,  then  evalu¬ 
ate  each  resulting  cluster.  If  the  majority  of  the 
elements  of  any  cluster  do  not  belong  to  stable 
viewpoints,  k  is  decreased,  then  the  process  is  re¬ 
peated.  Some  visual  aspects  from  different  object 
classes  are  shown  in  figure  3. 


3.4  Distinctiveness  of  Visual 
Aspects 

A  basic  tenet  of  active  object  recognition  is 
that  some  viewpoints  have  greater  distinctive¬ 
ness  than  others.  In  this  section,  we  es¬ 
tablish  the  distinctiveness  of  each  visual  as¬ 
pect  using  STRoud  [Barbara  et  ah,  2006],  a  test 
which  evaluates  the  distinctiveness  (or  conversely 
“strangeness”)  of  the  members  of  each  class.  The 
strangeness  of  a  member  (m)  is  evaluated  using 
the  ratio  of  the  distance  to  other  objects  of  that 
class  c  over  the  distance  to  all  points  of  other 
classes  c-1  (Eq.  2).  In  practice,  we  evaluate  this 
by  selecting  the  k  smallest  distances. 


str(m ,  c) 


distancec{i ,  m ) 
Ylf=  i  distance C_1  (i,  m) 


(2) 


The  sum  of  the  distance  to  objects  in  the 
same  class  c  should  be  much  smaller  than  the 
sum  of  the  distances  to  other  classes  c_1.  There¬ 
fore,  a  distinctive  data  point  would  have  very  low 
strangeness.  When  referring  to  visual  aspects  (s), 
the  probability  that  we  have  correctly  identified 
object  ( o )  in  visual  aspect  s  is  determined  using 


observed  is  quite  similar  to  known  activation  pat¬ 
terns  for  object  o^,  we  expect  the  probability  to 
be  high.  Similarly,  eq.  6  can  be  interpreted  as  the 
general  confidence  of  recognizing  object  o  in  esti¬ 
mated  visual  aspect  s.  Combining  the  two  (eq.  4) 
produces  a  uncertainty  measure  that  accounts  for 
both  similarity  of  activation  patterns  as  well  as 
the  confidence  in  the  visual  aspect. 

3.6  Active  Recognition 

When  confidence  is  low,  a  single  image  may  not 
be  sufficient  to  correctly  recognize  the  object.  In 
these  cases,  we  make  a  small  local  movement  to 
view  the  object  from  a  slightly  different  perspec¬ 
tive,  then  combine  the  measurements.  The  prob¬ 
ability  that  the  object  belongs  to  class  i,  as  was 
suggested  in  [Denzler  and  Brown,  2002],  is  esti¬ 
mated  using  the  product  of  all  measurements  that 
have  been  taken  over  time  (n).  This  also  has 
the  potential  for  incorporating  a  prior  probability, 
which  we  have  set  to  a  uniform  probability. 


n 

P(i)  =  PiPix  |  &X  i  )  (7) 

x=l 


P(»|»)  =  <  jMgifM  (3) 

3.5  Recognition  from  a  Single 
Viewpoint 

To  recognize  object  o,  we  use  the  IT  activa¬ 
tion  pattern  from  Leabra  (a),  then  compare 
this  against  known  classes.  We  compute  the 
strangeness  that  the  object  belongs  to  each  class 
using  Eq.  2.  The  probability  of  recognition  is  con¬ 
ditionally  dependent  upon  the  distinctiveness  of 
the  visual  aspect  s,  as  well  as  the  confidence  that 
the  object  belongs  to  visual  aspect  s.  The  prob¬ 
ability  that  we  have  recognized  an  object  of  class 
o  is  p(oix\axsx):  for  object  i  using  image  x. 

P(@ix  l&xi  &x)  P[Oix\Qjx)p('Oix\Sx)  (4) 
p{pix \^x)  =  Ck,p(ax \oix)p(pix)  (5) 

p{oix\sx)  =  oip(sx\oix)p(oix)  (6) 

Eq.  5  can  be  interpreted  as  the  probability 
that  we  have  observed  the  object  given  a  partic¬ 
ular  activation  pattern.  If  the  activation  pattern 


4  Experimental  Results 

We  experimentally  validate  our  approach  us¬ 
ing  the  RGB-D  dataset  [Lai  et  ah,  2011].  This 
particular  dataset  was  selected  due  to  its  large 
number  of  object  classes,  many  instances  of  each 
class,  and  the  range  of  poses  where  each  instance 
was  imaged.  A  few  examples  of  training  images 
are  in  Figure  5.  Our  experiments  are  conducted 
using  115  instances  of  28  object  classes.  RGB- 
D  has  images  of  objects  when  viewed  from  three 
different  levels  of  camera  pitch,  rotating  the  ob¬ 
ject  a  full  360°  at  each  pitch  level.  We  use  39 
randomly  selected  images  per  object  for  training 
(approximately  5%  of  the  images).  One  third  of 
the  remaining  images  were  used  for  validation, 
the  remaining  images  are  used  for  testing  (52,404 
images) . 

We  extract  the  object  using  the  foreground 
mask  provided  in  the  RGB-D  dataset.  The 
foreground  mask  represents  the  part  of  the  re¬ 
gion  that  is  not  on  the  table,  as  estimated 
using  the  Depth  information  provided  by  the 
Kinect.  The  size  of  the  object  was  normalized 
in  the  same  manner  as  was  previously  described 
in  [Pinto  et  ah,  2008].  The  purpose  of  foreground 


extraction  and  size  normalization  is  to  remove  ir¬ 
relevant  size  cues  and  to  provide  a  measure  of 
scale  invariance. 

Table  1  shows  the  recognition  rates  for  Leabra 
(i.e.,  single  viewpoint  or  static  object  recogni¬ 
tion),  and  active  object  recognition.  Active  ob¬ 
ject  recognition  has  been  set  for  very  high  confi¬ 
dence  (p=0. 99999)  and  therefore  will  only  recog¬ 
nize  an  object  when  it  is  extremely  confidence  in 
the  results.  Note  that  this  is  a  confidence  on  a 
decision- level  basis,  and  does  not  necessarily  pre¬ 
dict  the  overall  performance  of  the  system,  as  per¬ 
formance  is  driven  by  the  variability  of  the  testing 
data. 


23456789  10  11 


Figure  4:  Frequency  and  number  of  positions  used 
during  the  active  object  recognition  process. 

During  active  investigation,  on  average,  ob¬ 
jects  are  examined  at  2.4  positions  before  they 
are  recognized.  The  frequency  of  the  positions 
used  during  examination  are  shown  in  4. 

Across  all  of  the  objects,  the  static  approach 
has  a  precision  of  90.55%,  and  the  active  approach 
has  a  precision  of  96.81%.  Furthermore,  the  stan¬ 
dard  deviation  of  precision  varies  greatly  with  the 
approaches.  The  standard  deviation  for  static  is 
9.27%,  the  active  approach  is  4.95%.  This  indi¬ 
cates  that  not  only  is  the  accuracy  of  the  system 
improving,  but  the  number  of  objects  with  a  low 
level  of  accuracy  is  also  improving. 


5  Discussion 

State-of-the-art  approaches  to  object  recogni¬ 
tion  have  been  demonstrated  to  perform  very  well 
on  posed  objects.  We  have  shown  that  unposed 
objects  can  be  more  difficult  to  recognize,  par¬ 
ticularly  in  degenerate  viewpoints.  Further,  an 


active  strategy  can  boost  the  performance  of  the 
system  even  when  considering  a  simple  approach 
to  next  best  viewpoint  selection.  Using  only  a 
random  movement  strategy,  we  demonstrated  a 
6%  boost  in  improvement  without  significantly 
impacting  the  recognition  speed  of  the  system  (re¬ 
quiring  only  2.4  positions  on  average). 
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